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Abstract 



The recent explosion in available genetic data has led to significant advances in understanding 
the demographic histories of and relationships among human populations. It is still a challenge, 
however, to infer reliable parameter values for complicated models involving many popula- 
tions. Here we present MixMapper, an efficient, interactive method for constructing phyloge- 
netic trees including admixture events using single nucleotide polymorphism (SNP) genotype 
data. MixMapper implements a novel two-phase approach to admixture inference using moment 
statistics, first building an unadmixed scaffold tree and then adding admixed populations by 
solving systems of equations that express allele frequency divergences in terms of mixture pa- 
rameters. Importantly, all features of the tree, including topology, sources of gene flow, branch 
lengths, and mixture proportions, are optimized automatically from the data and include esti- 
mates of statistical uncertainty. MixMapper also uses a new method to express branch lengths 
in easily interpretable drift units. We apply MixMapper to recently published data for HGDP 
individuals genotyped on a SNP array designed especially for use in population genetics stud- 
ies, obtaining confident results for 30 populations, 20 of them admixed. Notably, we confirm a 
signal of ancient admixture in European populations — including previously undetected admix- 
ture in Sardinians and Basques — involving a proportion of 20-40% ancient northern Eurasian 
ancestry. 
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Introduction 



The most basic way to represent the evolutionary history of species or populations is through 
a phylogenetic tree, a model that in its strict sense assumes that there is no gene flow between 



populations after they have diverged (Cavalli-Sforza and Edwards 1967). In many settings, 
however, groups that have split from one another can still exchange genetic material. This is cer- 
tainly the case for human population history, during the course of which populations have often 
diverged only incompletely or diverged and subsequently mixed again (Reich et al. , |2009 Pat- 



terson et al4|20T2t |Wall et al] [20091 pval et al.[ |20T0t |Gravel et alj [20TT] [Pugach et al4|20TT 



Green et al. 2010; Reich et al. , 2010). To capture these more complicated relationships, previ- 



ous studies have considered models allowing for continuous migration among populations (Wall 



et al. 2009 ; Laval et al. 2010[ |Gravel et al.[ 201 1 ) or have extended simple phylogenetic trees 



into admixture trees, in which populations on separate branches are allowed to re-merge and 



form an admixed offspring population (Reich et al. 2009, Patterson et al. 2012 Chikhi et al. 



2001 1 |Wang[ |2003| |Sousa et aTj |2009[ ). Both of these frameworks, of course, still represent 



substantial simplifications of true population histories, but they can help capture a range of new 
and interesting phenomena. 

Several approaches have previously been used to build phylogenetic trees incorporating ad- 
mixture events. First, likelihood methods ( |Chikhi et alj [200T| [Wangl [2003| [Sousa et al4|2009| ) 
use a full probabilistic evolutionary model, which allows a high level of precision with the dis- 
advantage of greatly increased computational cost. Consequently, likelihood methods can in 
practice only accommodate a small number of populations ( Wall et aL}|2009[ Laval et al. 2010 



Gravel et al.[ 2011 Siren et aL} 201 Moreover, the tree topology must generally be speci- 



fied in advance, meaning that only parameter values can be inferred and not the arrangement 



of populations in the tree. By contrast, moment-based methods (Reich et al. 2009, Patterson 



et al.[ 2012[ ) use only means and variances of allele frequency divergences. Moments are sim- 



pler conceptually and especially computationally, and they allow for more flexibility in model 
conditions. Their disadvantages can include reduced statistical power and difficulties in design- 



3 



ing precise estimators with desirable statistical properties (e.g., unbiasedness) (Wang} 2003). 
Finally, a number of studies have considered "phylogenetic networks," which generalize trees 
to include cycles and multiple edges between pairs of nodes and can be used to model popu- 
lation histories involving hybridization (Huson and Bryant, 2006, Yu et al. 2012). However, 
these methods also tend to be computationally expensive. 

In this work, we introduce MixMapper, a new computational tool that fits admixture trees by 
solving systems of moment equations involving the pairwise distance statistic f 2 (R eich et"aL| 



2009; Patterson et al. 2012), which is the average squared allele frequency difference between 



two populations. The theoretical expectation of f 2 can be calculated in terms of branch lengths 
and mixture fractions of an admixture tree and then compared to empirical data. MixMapper 
can be thought of as a generalization of the qpgraph package ( Patterson et al.[|2012 ), which 
takes as input genotype data, along with a proposed arrangement of admixed and unadmixed 
populations, and returns branch lengths and mixture fractions that produce the best fit to allele 
frequency moment statistics measured on the data. MixMapper, by contrast, performs the fit- 
ting in two stages, first constructing an unadmixed scaffold tree via neighbor-joining and then 
automatically optimizing the placement of admixed populations onto this initial tree. Thus, no 
topological relationships among populations need to be specified in advance. 



Our method is similar in spirit to the independently developed TreeMix method (Pickrell 



and Pritchard, 2012[ ). Like MixMapper, TreeMix builds admixture trees from second moments 



of allele frequency divergences, although it does so via a composite likelihood maximization 
approach made tractable with a multivariate normal approximation. Procedurally, TreeMix is 
structured in a "top-down" fashion, whereby a full set of populations is initially fit as an unad- 
mixed tree, and gene flow edges are added sequentially to account for the greatest errors in the 



fit (Pickrell and Pritchard 2012). This format makes TreeMix well- suited to handling very large 
trees: the entire fitting process is automated and can include arbitrarily many admixture events 
simultaneously. In contrast, MixMapper is designed as an interactive tool to maximize flexibil- 
ity and precision with a "bottom-up" approach, beginning with a carefully screened unadmixed 
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scaffold tree to which admixed populations are added with best-fitting parameter solutions. 
We use MixMapper to model the ancestral relationships among 52 populations from the 



CEPH-Human Genome Diversity Cell Line Panel (HGDP) (Rosenberg etaLj 2002, Li et al. 



2008] ) using recently published data from a new, specially ascertained SNP array designed for 
population genetics applications (Keinan "et al. (12007} |Patterson et al.[|2012[ ). Previous studies 



of these populations have built simple phylogenetic trees (Li et al.[ 2008; Siren et al. 2011 1, 



identified a substantial number of admixed populations with likely ancestors (Patterson et al. 



20 1 2 ) , and constructed a large- scale admixture tree ( Pickrell and Pritchard 20 1 2 ) . Here, we add 
an additional level of quantitative detail, obtaining best-fit admixture parameters and bootstrap 
error estimates for 30 HGDP populations, of which 20 are admixed. The results include, most 



notably, a significant admixture event in the history of all sampled European populations (Pat 



terson et al. , 2012), among them Sardinians and Basques. 



Methods 



Problem setup and model assumptions 

The basic problem we consider is as follows: given an array of genotype (i.e., SNP) data sam- 
pled from a set of individuals grouped by population, what can we infer about the phylogeny 
and admixture history of these populations? We assume that all SNPs are neutral, biallelic, and 
autosomal, and that divergence times are short enough that there are no double mutations. Thus, 
allele frequency variation — the signal that we harness — is governed entirely by genetic drift and 
admixture. We model admixture as a one-time exchange of genetic material: two parent pop- 
ulations mix to form a single descendant population whose allele frequencies are a weighted 
average of the parents'. This model is of course a very rough rendition of true mixture events, 
but it is flexible enough to serve as a reasonable first-order approximation and lends itself to 



efficient analysis using /-statistics (Reich et al. 2009; Patterson et al. 2012). 
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Constructing an unadmixed scaffold tree 

Our MixMapper admixture-tree-building procedure consists of two phases (Figure [T]), the first 
of which selects a set of unadmixed populations to use as a scaffold tree. We begin by computing 



/ 3 statistics (Reich et al. 2009 Patterson et al. , 2012) for all triples of populations Pi, P 2 , P 



in the data set and removing those populations P 3 with any negative values f 3 (P 3 ; Pi, P 2 ), 
which indicate admixture. We then use pairwise f 2 statistics to build a neighbor-joining tree 
on every subset of the remaining populations. In the absence of admixture, f 2 distances are 
additive along a phylogenetic tree (Text|ST[ cf. |Patterson et al.| ( |2012[ )), meaning that neighbor- 



joining should recover a tree with leaf-to-leaf distances that are completely consistent with the 



pairwise f 2 data (Sa itou and Nei[ |1987| ). We therefore evaluate the quality of each putative 
unadmixed tree according to its maximum error between fitted and actual pairwise distances. 
Because of model violation in real data, trees built on smaller subsets are more additive, but 
they are also less informative; in particular, it is beneficial to include populations from as many 
continental groups as possible in order to maximize the utility of the scaffold for admixture 
fitting. MixMapper provides a ranking of trees by additivity as a guide from which the user 
chooses a suitable unadmixed tree. Finally, MixMapper adjusts the chosen tree by re-optimizing 
its branch lengths (maintaining the topology inferred from neighbor-joining) to minimize the 
sum of squared errors of all pairwise f 2 distances. 

Two-way admixture fitting 

The second phase of MixMapper begins by attempting to fit additional populations indepen- 
dently as simple two-way admixtures between branches of the unadmixed tree (Figure [T]). As- 
suming for the moment that we know the branches from which the ancestral mixing populations 
split for a given admixed population, we can construct a system of equations of f 2 statistics that 



allows us to infer parameters of the mixture (Text SI). Specifically, the squared allele fre- 
quency divergence between the admixed population and each unadmixed population X' can be 
expressed as an algebraic combination of known branch lengths along with four unknown mix- 



ture parameters: the locations of the split points on the two parental branches, the combined 
terminal branch length, and the mixture fraction (Figure [2j\). To solve for the four unknowns, 
we need at least four unadmixed populations X' that produce a system of four independent 
constraints on the parameters. This condition is satisfied if and only if the data set contains two 
populations X{ and X' 2 that branch from different points along the lineage connecting the di- 
vergence points of the parent populations from the unadmixed tree (Text [ST). If the unadmixed 
tree contains n > 4 populations, we obtain a system of n equations in the four unknowns that 
in theory is dependent. In practice, the equations are in fact slightly inconsistent because of 
noise in the f 2 statistics and error in the point- admixture model, so we perform least-squares 
optimization to solve for the unknowns; having more populations helps reduce the impact of 
noise. 

Algorithmically, MixMapper thus performs two-way admixture fitting by iteratively testing 
each pair of branches of the unadmixed tree as possible sources for the ancestral mixing pop- 
ulations. For each choice of branches, MixMapper builds the implied system of equations and 
finds the least-squares solution (under the constraints that unknown branch lengths are nonneg- 
ative and the mixture fraction a is between and 1), ultimately choosing the pair of branches 
and mixture parameters producing the smallest residual norm. Our procedure for optimizing 
each system of equations uses the observation that upon fixing a, the system becomes linear in 
the remaining three variables (Text[ST|). Thus, we can optimize the system by performing con- 
strained linear least squares within a basic one-parameter optimization routine over a 6 [0, 1]. 
To implement this approach, we applied MATLAB's lsqlin and fminbnd functions with a 
few auxiliary tricks to improve computational efficiency (detailed in the code). 

Three-way admixture fitting 

MixMapper also fits three-way admixtures, i.e., those for which one parent population is itself 
admixed (Figure [2^). Explicitly, after an admixed population M 1 has been added to the tree, 
MixMapper can fit an additional user- specified admixed population M 2 as a mixture between 
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the Mi terminal branch and another (unknown) branch of the unadmixed tree. The fitting al- 
gorithm proceeds in a manner analogous to the two-way mixture case: MixMapper iterates 
through each possible choice of the third branch, optimizing each implied system of equations 
expressing f 2 distances in terms of mixture parameters. With two mixed populations, there 
are now 2n + 1 equations — f 2 {Mi, X') and f 2 (M 2 , X') for all unadmixed populations X' , and 
a ls° f2{Mi, M 2 ) — and eight unknowns: two mixture fractions, a\ and a 2 , and six linear branch 
length parameters (Figure |2j3). Fixing «i and a 2 results in a linear system as before, so we 
perform the optimization using MATLAB's lsqlin within fminsearch applied to a± and 
a 2 in tandem. The same mathematical framework extends to optimizing the placement of pop- 
ulations with arbitrarily many ancestral waves of admixture, but for simplicity and to reduce the 
risk of overfitting, we chose to limit this version of MixMapper to three-way admixtures. 

Expressing branch lengths in drift units 

All of the tree-fitting computations described thus far are performed using pairwise distances 
in f 2 units, which are mathematically convenient to work with owing to their additivity along 
a lineage (in the absence of admixture). However, f 2 distances are not directly interpretable in 
the same way as genetic drift D, which is a simple function of time and population size: 

D « 1 - exp(-t/2N e ) « 2 ■ F ST , 



where t is the number of generations and N e is the effective population size (Nei, 1987). To 
convert f 2 distances to drift units, we apply a new formula, dividing the f 2 -\ength of each 
branch by a heterozygosity value that we infer for the ancestral population at the top of the 
branch (Text|S2|). Qualitatively speaking, this conversion corrects for the relative stretching of 
f 2 branches at different portions of the tree as a function of heterozygosity ( Patterson et al.[ 



2012). In order to infer ancestral heterozygosity values accurately, it is important to use SNPs 



that are ascertained in an outgroup to the populations involved, which we address further below. 
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Before inferring heterozygosities at ancestral nodes of the unadmixed tree, we must first 
determine the location of the root (which is neither specified by neighbor-joining nor involved 
in the preceding analyses). MixMapper does so by iterating through branches of the unadmixed 
tree, temporarily rooting the tree along each branch, and then checking for consistency of the 
resulting heterozygosity estimates. Explicitly, for each internal node P, we split its present-day 
descendants (according to the rerooted tree) into two groups G\ and G 2 according to which 
child branch of P they descend from. For each pair of descendants, one from G\ and one 



from G2, we compute an inferred heterozygosity at P (Text S2). If the tree is rooted properly, 
these inferred heterozygosities are consistent, but if not, there exist nodes P for which the 
heterozygosity estimates conflict. MixMapper is thus able to infer the location of the root as 
well as the ancestral heterozygosity at each internal node, after which it applies the drift length 
conversion as a post-processing step on fitted f 2 branch lengths. 

Bootstrapping 

In order to measure the statistical significance of our parameter estimates, we compute bootstrap 
confidence intervals (Efronj 1979, Efron and Tibshirani 1986) for the inferred branch lengths 



and mixture fractions. Our bootstrap procedure is designed to account for both the randomness 
of the drift process at each of a finite number of SNPs and the random choice of individuals to 
represent each population. First, we divide the genome into 50 evenly-sized blocks, with the 
premise that this scale should easily be larger than that of linkage disequilibrium among our 
SNPs. Then, for each of 500 replicates, we resample the data set by (a) selecting 50 of these 
SNP blocks at random with replacement; and (b) for each population group, selecting a random 
set of individuals with replacement, preserving the number of individuals in the group. 

For each replicate, we recalculate all pairwise f 2 distances and present-day heterozygosity 
values using the resampled SNPs and individuals (adjusting the bias-correction terms to ac- 
count for the repetition of individuals) and then construct the admixture tree of interest. Even 
though the mixture parameters we estimate (branch lengths and mixture fractions) depend in 
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complicated ways on many different random variables, we can directly apply the nonparametric 
bootstrap to obtain confidence intervals (Efron and Tibshirani, 1986). For simplicity, we use a 



percentile bootstrap; thus, our 95% confidence intervals indicate 2.5 and 97.5 percentiles of the 
distribution of each parameter among the replicate trees. 

Computationally, we parallelize MixMapper's mixture-fitting over the bootstrap replicates 
using MATLAB's Parallel Computing Toolbox. 

Evaluating fit quality 

We use several criteria to evaluate the mixture fits produced by MixMapper and distinguish 
high-confidence results from possible artifacts of overfitting. 

First, we can compare MixMapper results to information obtained from other methods, such 
as the 3-population test ( |Reich et aL}|2009} [Patterson et al.[|2012| ). Negative / 3 values for a given 



target population indicate robustly that the population is admixed, and comparing / 3 statistics 
for different reference pairs can give useful clues about the ancestral mixing populations. Thus, 
while the 3-population test relies on similar data to MixMapper, its simpler form makes it useful 
for confirmation. 

Second, the consistency of parameter values over bootstrap replicates gives an indication 
of the robustness of the admixture fit in question. All of our results have some amount of as- 
sociated uncertainty, but we dismiss those with particularly large error bars. Most often, this 
phenomenon is manifested in the placement of ancestral admixing populations: for poorly fit- 
ting admixtures, these will often move between different branches of the tree from one replicate 
to the next, signaling unreliable results. 

Third, we place less faith in results with highly skewed values of a. We expect that if we 
try to fit a non-admixed population as an admixture, MixMapper should return a closely related 
population as the first branch with mixture fraction a ~ 1 (and an arbitrary second branch). 
Indeed, we often observe this pattern, although we also find that the second branch tends to 
have a mixture fraction of at least a few percent. We expect overfitting is likely in these cases. 
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Fourth, for any inferred admixture event, the two mixing populations must be contempora- 
neous. Since we cannot resolve the three pieces of terminal drift lengths leading to admixed 
populations (Figure ]2]\) and our branch lengths depend both on population size and absolute 
time, we cannot say for sure whether this property is satisfied for any given mixture fit. In some 
cases, however, it is clear that no realization of the variables could possibly be consistent: if we 
infer an admixture between a very recent branch and a very old one with a small value of the 
terminal drift c, then we can confidently say the mixture is unreasonable. 



Data set 



We analyzed a SNP data set from 934 HGDP individuals grouped in 53 populations ( |Rosenberg 
et al.[ 2002] Li et al.[ 2008). Unlike most previous studies of the HGDP samples, however, we 
worked with recently published data generated using the new Affymetrix Axiom Human Ori- 
gins Array ( |Patterson et al.[ 2012), which was designed with a simple ascertainment scheme for 
accurate population genetic inference (Keinan et al. |2007| ). It is well known that ascertainment 



bias can cause errors in estimated divergences among populations (Clark et ah] 2005, Albrecht 



sen et al. 2010), since choosing SNPs based on their properties in modern populations induces 



non-neutral spectra in related samples. While there do exist methods to correct for ascertain- 
ment bias ( [Nielsen et al. 12004] ), it is much more desirable to work with a priori bias-free data, 
especially given that typical SNP arrays are designed using opaque ascertainment schemes. 
To avoid these pitfalls, we used Panel 4 of the new array, which consists of 163,313 SNPs 



that were ascertained as heterozygous in the genome of a San individual (Keinan et al. 2007). 
This panel is special because there is evidence that the San are approximately an outgroup to 
all other modern-day human populations (Li et al. , 2008] Gronau et al.] 2011] ). Thus, while the 
Panel 4 ascertainment scheme distorts the San allele frequency spectrum, it is nearly neutral 
with respect to all other populations. In other words, we can think of the ascertainment as 
effectively choosing a set of SNPs (biased toward San heterozygosity) at the common ancestor 
of the remaining 52 populations, after which drift occurs in a bias-free manner. We excluded 
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61,369 SNPs that are annotated as falling between the transcription start site and end site of a 



gene in the UCSC Genome Browser database (Fujita et al. , 201 1 ). Most of the excluded SNPs 
are not within actual exons, but, as expected, the frequency spectra at these "gene region" loci 
were slightly shifted toward fixed classes relative to other SNPs, indicative of the action of 



natural selection (Figure SI ). Since we assume neutrality in all of our analyses, we chose to 
remove these SNPs. 



Results 

Applying MixMapper to the HGDP populations, we built an unadmixed scaffold tree with 10 
populations, upon which 20 more populations fit robustly as admixtures (Figure[3]). We describe 
these results in detail below. 

Unadmixed phylogeny inferred for 10 HGDP populations 

The first phase of our MixMapper analysis identified an (approximately) unadmixed scaffold 
tree containing 10 populations: Dai, Japanese, Karitiana, Lahu, Mandenka, Naxi, Papuan, 
Sunn, Yi, and Yoruba (Figure [3jB). Despite the focus of the HGDP on isolated populations, 
most of the 53 HGDP groups exhibit signs of admixture detectable by the 3-population test, as 



has been noted previously (Patterson et al. 2012). Our initial filtering step of removing pop 



ulations with negative values / 3 values (indicative of recent admixture) left only 20 that are 
potentially unadmixed. Furthermore, most subsets including even half of those 20 populations 
exhibited significant divergence from the / 2 " a dditivity that should hold in the case of pure drift 
(Text [SlJ cf . |Patterson et ah] pOHj )). 



As mentioned above, it is desirable to include a wide range of populations in the scaffold 
tree to provide both geographic coverage and extra equations in order to facilitate the fitting 
of admixed populations. Additionally, including at least four continental groups provides a 
fairer evaluation of additivity; with three or fewer, any quartet of populations must contain 
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at least two that are closely related. At the same time, including too many populations can 
compromise accuracy. The populations we selected form one of the most additive 10-population 
subsets representing at least four of the five major continental groups (Africa, Americas, Asia, 
Europe, Oceania) in the HGDP data set. To check that their placement in an unadmixed tree is 
reasonable, we confirmed that none of the 10 populations can be fit in a reasonable way as an 
admixture on a tree built with the other nine (data not shown). 



Ancient admixture in the history of present-day European populations 

A notable feature of our unadmixed scaffold tree is that it does not contain any European popu- 
lations. We had ruled out including any HGDP Europeans other than Sardinian and Basque on 



the basis of at least one significantly negative f 3 value, as has been previously reported (Patter 



son et al.| |2012[ ). Moreover, we found that potential / 2 -distance trees containing Sardinian or 
Basque along with representatives of at least three other continents were noticeably less addi- 
tive than four-continent trees of the same size without Europeans. For example, on a set of 16 
potentially unadmixed populations, none of the 100 most additive 10-population trees include 
Europeans. This points to the presence of admixture in Sardinian and Basque as well as the 
other European populations. 

Using MixMapper, we added each of the European populations to the unadmixed tree via 
admixtures (Figure |4| Table [T]). For all eight groups in the HGDP data set, the best fit was as 
a mixture of a population related to the common ancestor of Karitiana and Suruf (in varying 
proportions of about 20-40%, with Sardinian and Basque among the lowest and Russian the 
highest) with a population related to the common ancestor of all unadmixed non- African pop- 
ulations on the tree. All eight European populations were fit independently, but notably, their 
ancestors were found to branch from the scaffold tree at very similar points, suggesting a sim- 
ilar broad-scale history. Their branch positions are also qualitatively consistent with previous 
work that used the 3-population test to deduce ancient admixture for Europeans other than Sar- 



dinian and Basque (Patterson et al. 2012). To confirm the signal in Sardinian and Basque, we 
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re-computed their mixture proportions using / 4 ratio estimation (Reich et al.[ |2009] |Patterson 



et al.[ 2012), which like MixMapper uses allele frequency statistics but in a very simple and in- 



tuitive framework. We estimated approximately 20-25% "ancient northern Eurasian" ancestry 
(Table SI ), which is in very good agreement with our findings from MixMapper (Table [T]). 

At first glance, this inferred admixture might appear improbable on geographical and chrono- 
logical grounds, but importantly, the two ancestral branch positions do not represent the mixing 
populations themselves. Rather, there may be substantial drift from the best-fit branches to the 
true mixing populations, indicated as branch lengths a, b, and c in Figure [4]A Unfortunately, 
these three lengths appear only in a fixed linear combination in the system of / 2 equations 



(Text SI ), and current methods can only give estimates of this linear combination rather than 



the individual values (Pa tterson et al.[|2012| ). One plausible arrangement, however, is shown in 
Figure |4]\ for the case of Sardinian. 



Two-way admixtures outside of Europe 

We also found several other populations that fit robustly onto the unadmixed tree using simple 
two-way admixtures (Table [2]). All of these can be identified as admixed using the 3-population 
or 4-population tests ( |Patterson et al.[|20T2"| ), but with MixMapper, we were able to provide the 
full set of best-fit parameter values to place them onto an admixture tree. 

First, we found that four populations from North-Central and Northeast Asia — Daur, Hezhen, 
Oroqen, and Yakut — are likely descended from admixtures between native North Asian popu- 
lations and East Asian populations related to Japanese. The first three are estimated to have 
roughly 10-30% North Asian ancestry, while Yakut has 50-70%. Melanesians fit optimally as a 
mixture of a Papuan-related population with an East Asian population close to Dai, in a propor- 
tion of roughly 80% Papuan-related, similar to previous estimates (Reich et al. |2011 ; Xu et al.| 
2012[ ). Finally, we found that Han Chinese have an optimal placement as an approximately 
equal mixture of two ancestral East Asian populations, one related to modern Dai (likely more 
southerly) and one related to modern Japanese (likely more northerly), corroborating a previous 
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finding of admixture in Han populations between northern and southern clusters in a large-scale 
analysis of East Asia (|HUGO Pan- Asian SNP Consortium] |2009j). 



Recent admixtures involving western Eurasians 

Finally, we inferred the branch positions of several populations that are well known to be re- 
cently admixed (cf. Patterson et al.| ( 2012[ ); Pickrell and Pritchard (2012)) but for which one 



ancestral mixing population was itself anciently admixed in a similar way to Europeans. To do 
so, we exploited the capability of MixMapper to fit three-way admixtures (Figure [2j3), using 
the anciently admixed branch leading to Sardinian as one ancestral source branch. First, we 
found that Mozabite, Bedouin, Palestinian, and Druze, in decreasing order of African ancestry, 
are all optimally represented as a mixture between an admixed western Eurasian population 
(not necessarily European) related to Sardinian and an African population (Table [3]). We also 
obtained good fits for Uygur and Hazara as mixtures between a western Eurasian population 
and a population related to the common ancestor of all East Asians on the tree (Table [3]). 

Comparison to results from TreeMix 

MixMapper provides a nuanced view of human population relationships, refining approximate 
phylogenetic tree models by incorporating numerous ancestral admixture events (Figure [3]). 
Our results are similar in spirit to an admixture tree for HGDP populations produced with the 



TreeMix software (Pickrell and Pritchard 2012) (Figure S2 ), but there are also important differ 



ences. Both methods fit Palestinian, Bedouin, Druze, Mozabite, Uygur, and Hazara as admix- 
tures, but MixMapper analysis suggests that these populations are better-modeled as three-way 
admixed. TreeMix fits Brahui, Makrani, Cambodian, and Maya — all of which the 3-population 
test identifies as admixed but we are unable to place reliably with MixMapper — while MixMap- 
per confidently fits Daur, Hezhen, Oroqen, Yakut, Melanesian, and Han. Perhaps most notably, 
TreeMix, unlike MixMapper, does not infer a widespread ancient admixture for Europeans, 
although it does identify gene flow from an ancestor of Native Americans to Russians and 
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from Orcadian to an ancestor of Americans, which could reflect the same historical signal (Fig- 
ure [52). 

MixMapper and TreeMix also differ in their coverage of populations ultimately modeled. 
With MixMapper, we chose to create admixture trees involving only pre-selected approximately 
unadmixed populations, upon which admixed populations of interest are added on a case-by- 
case basis only if they fit reliably as two- or three-way admixtures. In contrast, TreeMix returns 
a single large-scale admixture tree containing all populations in the data set, which may in- 
clude some that can be shown to be admixed but are not modeled as such. Overall, we view 
MixMapper as "semi-automated" compared to TreeMix, which is almost fully automated. Both 
approaches have benefits: ours allows more manual guidance and lends itself to interactive use, 
whereas TreeMix requires less user intervention, although some care must be taken in choosing 



the number of gene flow events to include (10 in Figure S2) to avoid creating spurious mix- 
tures. Finally, while TreeMix is quite efficient, running its HGDP analysis in a matter of hours, 
MixMapper runs analyses of individual populations on a faster time scale that promotes rapid 
interactive investigation. After initial setup to compute allele frequency statistics and potential 
scaffold trees, MixMapper determines the best-fit admixture model for a chosen population in 
a few seconds. We have found this aspect of the software useful for exploring the reliability of 
mixture fits and their sensitivity to assumptions; for example, it is easy to test very quickly the 
effect of tweaking the composition of the scaffold tree and to experiment with fitting populations 
of interest as two- or three-way admixtures. 

Estimation of ancestral heterozygosity 

A significant advantage of using data free from ascertainment bias is that it enables us to com- 
pute accurate estimates of the heterozygosity (over a given set of SNPs) throughout an unad- 
mixed tree, including at ancestral nodes. This in turn allows us to convert branch lengths from 
f 2 units to easily interpretable drift lengths, as discussed above and in Text S2 In Figure [5p, 



we show our estimates for the heterozygosity (averaged over all San-ascertained SNPs used) at 
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the common ancestor of each pair of present-day populations in the tree. Consensus values are 
given at the nodes of Figure [5]A.. The imputed heterozygosity should be the same for each pair 
of descendants having the same common ancestor, and indeed, with the new data set, the agree- 
ment is excellent (Figure |5p). By contrast, inferences of ancestral heterozygosity are much less 
accurate using HGDP data from the original Illumina SNP array ( |Li et al.[ |2008| ) because of 



ascertainment bias (Figure |5jB); f 2 statistics are also affected but to a lesser degree (Figure [S3]) 



as previously demonstrated (Patt erson et al.[ [2QT2| ). We used these heterozygosity estimates to 



express branch lengths of all of our trees in drift units (see Text S2 ). 



Discussion 

The MixMapper framework generalizes and automates several previous moment-based admix- 
ture inference methods, incorporating them as special cases and enabling comprehensive testing 
across many potential admixture scenarios. Mathematically, MixMapper sets up and solves a 
system of equations relating unknown mixture parameters to a complete list of measured pair- 
wise f 2 statistics. Methods such as the three-population test for admixture and / 4 ratio esti- 
mation ( |Reich et~aL 2009 ; Patterson et al. , 2012[ ) have similar theoretical underpinnings, as 



they make inferences on small subsets of populations using / 3 and / 4 statistics, which can be 
expressed as linear combinations of / 2 statistics. The main benefit of MixMapper is that by ana- 
lyzing more populations simultaneously and automatically considering different tree topologies 
and sources of gene flow, it produces results that can be easier to interpret and can uncover 
unexpected admixture events. 

For example, negative f 3 values — i.e., three-population tests indicating admixture — can be 
expressed in terms of relationships among f 2 distances between populations in an admixture 
tree. In general, three-population tests can be somewhat difficult to interpret because the sur- 
rogate parent populations may not in fact be closely related to the true participants in the ad- 
mixture, e.g., in the "outgroup case" ( |Reich et al.[ 2009, Patterson et al.[ 2012[ ). The relations 



among the f 2 statistics incorporate this situation naturally, however, and solving the full system 

17 



recovers the true branch points wherever they are. As another example, f± ratio estimation in- 
fers mixture proportions of a single admixture event from / 4 statistics involving the admixed 
population and four unadmixed populations situated in a particular topology (Reich et al. 2009 



Patterson et al.[ 2012[ ). Whenever data for five such populations are available, the system of all 



/ 2 equations that MixMapper solves to obtain the mixture fraction becomes equivalent to the 
/ 4 ratio computation. More importantly, because MixMapper infers all of the topological re- 
lationships within an admixture tree automatically by optimizing the solution of the distance 
equations over all branches, we do not need to specify in advance where the admixture took 
place — which is not always obvious a priori. By using more than five populations, MixMapper 
also benefits from more data points to constrain the fit. 

Due in part to this generality of the MixMapper approach, we were able to obtain the no- 
table result that all European populations in the HGDP are optimally represented as mixtures 
between a population related to the common ancestor of Americans and a population related 
to the common ancestor of all non- African populations in our unadmixed tree, confirming and 
extending an admixture signal first reported by Patterson et al. ( 2012[ ). Our interpretation is 
that most if not all modern Europeans are descended from at least one large-scale ancient ad- 
mixture event involving, in some combination, at least one population of Mesolithic European 
hunter-gatherers; Neolithic farmers, originally from the Near East; and/or other migrants from 
northern or Central Asia. Either the first or second of these could be related to the "ancient 
western Eurasian" branch in Figure |4} and either the first or third could be related to the "an- 
cient northern Eurasian" branch. Present-day Europeans differ in the amount of drift they have 
experienced since the admixture and in the proportions of the ancestry components they have 
inherited, but their overall profiles are similar. 

Our results for Europeans are consistent with several previously published lines of evi- 



dence ( |Pinhasi et al.[ |2012[ ). First, it has long been hypothesized, based on analysis of a few 
genetic loci (especially on the Y chromosome), that Europeans are descended from ancient ad- 
mixtures ( |Semino et al.[ |2000| |Dupanloup et al.[ |2004] |Soares et"aL| |2010[ ). Ancient European 
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admixture has been studied with genome- wide statistics in Patterson et al. ( 2012[ ) with largely 
similar conclusions. Our results also suggest an interpretation for a previously unexplained 
frappe analysis of worldwide human population structure (using K = 4 clusters) showing that 
almost all Europeans contain a small fraction of American-related ancestry ( Li et al.[ 2008| ). 
Finally, sequencing of ancient DNA has revealed substantial differentiation in Neolithic Europe 



between farmers and hunter-gatherers (Bramanti et al. 2009), with the former more closely re- 



lated to present-day Middle Easterners (Haaketal.2010) and southern Europeans ( Keller etal. 



2012, Skoglund et al.[ 2012) and the latter more similar to northern Europeans (Skoglund et al. 



2012[ ), a pattern perhaps reflected in our observed northwest- southeast cline in the proportion 



of "ancient northern Eurasian" ancestry (Table [T]). Further analysis of ancient DNA may help 
shed more light on the sources of ancestry of modern Europeans. 

One important new insight of our European analysis is that we detect the same signal of 
admixture in Sardinian and Basque as in the rest of Europe. As discussed above, unlike other 
Europeans, Sardinian and Basque cannot be confirmed to be admixed using the 3-population test 



(as in Patterson et al. (2012)), likely due to a combination of less "ancient northern Eurasian" 
ancestry and more genetic drift since the admixture (Table [T]). The first point is further compli- 
cated by the fact that we have no unadmixed "ancient western Eurasian" population available to 
use as a reference; indeed, Sardinians themselves are often taken to be such a reference. How- 
ever, MixMapper uncovered strong evidence for admixture in Sardinian and Basque through 
additivity-checking in the first phase and automatic topology optimization in the second phase, 
discovering an arrangement of unadmixed populations enabling admixture parameter inference, 
which we then verified directly with / 4 ratio estimation. Perhaps the most convincing evidence 
of the robustness of this inference is the fact that MixMapper infers branch points for the an- 
cestral mixing populations that are very similar to those of other Europeans (Table [TJ), a concor- 
dance that is most parsimoniously explained by a shared history of ancient admixture among 
Sardinian, Basque, and other European populations. Finally, we note that because we fit all Eu- 
ropean populations directly onto our scaffold tree without assuming Sardinian or Basque to be 
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an unadmixed reference, our estimates of the "ancient northern Eurasian" ancestry proportions 



in Europeans are larger than those in Patterson et al. (2012) and we believe more accurate than 



those previously reported (Skoglu nd et al.[|2012||Patterson et aLH2012[ ). 

It is worth noting that of the 52 populations (excluding San) in the HGDP data set, there were 
22 that we were unable to fit in a reasonable way either on the unadmixed tree or as admixtures. 
In part, this was because our simple point- admixture model is intrinsically limited in its ability 
to capture complicated population histories. Most parts of the world have surely witnessed low 
levels of inter-population migration over time, especially between nearby populations, making 
it difficult to fit admixture trees to the data. We also found cases where having data from more 
populations would help the fitting process, for example for three-way admixed populations such 
as Maya where we do not have a sampled group with a simpler admixture history that could be 
used to represent two of the three components. Similarly, we found that while Central Asian 
populations such as Burusho, Pathan, and Sindhi have clear signals of admixture from the 3- 
population test, they likely have ancestry from several different sources (including sub-Saharan 
Africa in some instances), making them difficult to fit with MixMapper. Finally, we have chosen 
here to disregard admixture with archaic humans, which is known to be a small but noticeable 
component for most populations in the HGDP (Green et al. 20 10[ [Reich et al. , 2010). 

In certain applications, full genome sequences are beginning to replace more limited geno- 
type data sets such as ours, but we believe that our methods and SNP-based inference more 
generally will still be valuable in the future. Despite the increasing feasibility of sequencing, 
it is still much easier and less expensive to genotype samples using a SNP array, and with 
over 100,000 loci, the data used in this study provide substantial statistical power. Addition- 
ally, sequencing technology is currently more error-prone, which can lead to biases in allele 



frequency-based statistics ( |Pool et aLj |2010[ ): for example, rare alleles can be difficult to dis- 
tinguish from incorrect base calls, meaning that error correction will tend to flatten empirical 
frequency spectra. We expect MixMapper will continue to contribute to an important niche of 
population history inference methods based on SNP allele frequency data. 
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Software 

Source code for the MixMapper software is available at |http : //groups . csail .mit . 
edu/cb/mixmapper/[ 
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Figures 



Phase 1: Unadmixed scaffold 
tree construction 



Unadmixed population filtering 
(/3-statistics > 0) 
Unadmixed subset selection 
(ranking by f 2 -additivity) 
Scaffold tree building 
(neighbor joining) 



Phase 2: Admixture fitting 
► 



r 



Two-way mixture fitting 
• Three-way mixture fitting 
. • Conversion to drift units 

Bootstrap resampling 



Figure 1. MixMapper workflow. MixMapper takes as input an array of SNP calls annotated 
with the population to which each individual belongs. The method then proceeds in two 
phases, first building a tree of (approximately) unadmixed populations and then attempting to 
fit the remaining populations as admixtures. In the first phase, MixMapper produces a ranking 
of possible unadmixed trees in order of deviation from / 2 -additivity; based on this list, the user 
selects a tree to use as a scaffold. In the second phase, MixMapper tries to fit remaining 
populations as two- or three-way mixtures between branches of the unadmixed tree. In each 
case MixMapper produces an ensemble of predictions via bootstrap resampling, enabling 
confidence estimation for inferred results. 



26 



Branch"! Loc (Pre-Split / Total) 



B 



AdmixedPop 



, ; (a ■ Parentl + p ■ Parent2) 
MixedDrift = a 2 a+$ 2 b+c 



Branch2Loc 
(Pre-Split / Total) 



MixedDriftIA 



FinalDriftlB 

AdmixedPopI 



■L 



MixedDrift2 
AdmixedPop2 



Branch3Loc 
(Pre-Split / Total) 



Figure 2. Schematic of mixture parameters fit by MixMapper. (A) A simple two-way 
admixture. MixMapper infers four parameters when fitting a given population as an admixture. 
It finds the optimal pair of branches between which to place the admixture and reports the 
following: Branch 1 Loc and Branch2Loc are the points at which the mixing populations split 
from these branches; a is the proportion of ancestry from Branch 1 and (3 = 1 — a is the 
proportion from Branch2; and MixedDrift is the linear combination of drift lengths 
a 2 a + (3 2 b + c. (B) A three-way mixture: here AdmixedPop2 is modeled as an admixture 
between AdmixedPopI and Branch3. There are now four additional parameters; three are 
analogous to the above, namely, Branch3Loc, a 2 , and MixedDrift2. The remaining degree of 
freedom is the position of the split along the AdmixedPopI branch, which divides MixedDrift 
into MixedDriftIA and FinalDriftlB. 
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Figure 3. Aggregate phylogenetic trees of HGDP populations with and without 
admixture. (A) A simple neighbor-joining tree on the 30 populations for which MixMapper 



produced high-confidence results. This tree is analogous to the one given by Li et al. (2008 



Figure IB), and the topology is very similar (for the sake of comparison, we use the same color 
scheme). (B) Results from MixMapper. The populations appear in roughly the same order, but 



the majority are inferred to be admixed, as represented by dashed lines (cf. Pickrell and 



Pritchard (2012) and Figure S2). The admixed populations can be grouped into six categories, 



as indicated. Note that drift units are not additive, so branch lengths should be measured 
individually. 
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Figure 4. Inferred anceint admixture in Europe. (A) Detail of the inferred ancestral 
admixture for Sardinians (other European populations are similar). One mixing population 
splits from the unadmixed tree along the common ancestor branch of Americans ("Ancient 
Northern Eurasian") and the other along the common ancestor branch of all non- Africans 
("Ancient Western Eurasian"). Median parameter values are shown; 95% bootstrap confidence 
intervals can be found in Table [T] The branch lengths a, b, and c are confounded, so we show a 
plausible combination. (B) Map showing a sketch of possible directions of movement of 
ancestral populations. Colored arrows correspond to labeled branches in (A). 



29 



0.143 



0.178 



C. 777 



0.190 



0.246 



C.170 



-n 0.1 17 — Surui 



-o 0.126 — Karitiana 



p 0.170 — Naxi 
0.171 
= 0.171 — Yi 

> 0.168 — Japanese 
0.172 

-n 0.166 — Lahu 
a 0.167 — Dai 



p 0.238 — Mandenka 



fa 0.239 — Yoruba 



-a 0.144 — Papuan 



0.1 



_i i i i i i i i i_ 



Drift length (~2F ST ) 



B 



CD 
to 
CD 
C 

CD =S 

o-r 



111 



urara.-m-njcoraajo 
W ^ Z> ™ jQO.2' 



>- 



Surui 
Karitiana 
Naxi 



Japanese 



Lahu 



Papuan 
Mandenka 
Yoruba 




Surui 
Karitiana 
Naxi 



Japanese| 
Lahu 



Papuan 
Mandenka 
Yoruba 




10.35 

0.3 

0.25 

0.2 



0.24 
0.22 
0.2 
0.18 
0.16 
0.14 
lo.12 



Figure 5. Ancestral heterozygosity imputed from original Illumina vs. San-ascertained 
SNPs. (A) The 10-population unadmixed tree with estimated average heterozygosities using 
SNPs from Panel 4 (San ascertainment) of the Affymetrix Human Origins array (Patterson 



et al. 2012 ). Numbers in black are direct calculations for modern populations, while numbers 



in green are inferred values at ancestral nodes. (B, C) Computed ancestral heterozygosity at 
the common ancestor of each pair of modern populations. With unbiased data, values should 
be equal for pairs having the same common ancestor. (B) Values from a filtered subset of about 
250,000 SNPs from the published Illumina array data ( |Li et al4|200~8] ). (C) Values from the 
Human Origins array excluding SNPs in gene regions. 



30 



Tables 



Table 1. Mixture parameters for Europeans. 



Admixed Pop 


# rep 


Q 


BranchlLoc (Anc. N. Eurasian) 


Branch2Loc (Anc. W. Eurasian) 


MixedDrift 


Adygei 


500 


0.254-0.461 


0.033-0.078/0.195 


0.140-0.174/0.231 


0.077-0.092 


Basque 


464 


0.160-0.385 


0.053-0.143/0.196 


0.149-0.180/0.231 


0.105-0.121 


French 


491 


0.184-0.386 


0.054-0.130/0.195 


0.149-0.177/0.231 


0.089-0.104 


Italian 


497 


0.210-0.415 


0.043-0.108/0.195 


0.137-0.173/0.231 


0.092-0.109 


Orcadian 


442 


0.156-0.350 


0.068-0.164/0.195 


0.161-0.185/0.231 


0.096-0.113 


Russian 


500 


0.278-0.486 


0.045-0.091 / 0.195 


0.146-0.181 / 0.231 


0.079-0.095 


Sardinian 


480 


0.150-0.350 


0.045-0.121 / 0.195 


0.146-0.176/0.231 


0.107-0.123 


Tuscan 


489 


0.179-0.431 


0.039-0.118/0.195 


0.137-0.177/0.231 


0.088-0.110 



Mixture parameters from MixMapper for modern-day European populations (cf. Patterson 



et al. (2012)). All eight are nearly unanimously optimized as a mixture between populations 
related to the "ancient northern Eurasian" and "ancient western Eurasian" branches in the 
unadmixed tree (see Figure |4j^.). BranchlLoc and Branch2Loc are the points at which the 
mixing populations split from these branches; a is the proportion of ancestry from the "ancient 
northern Eurasian" side; MixedDrift is the sum of drift lengths a 2 a + (1 — a) 2 b + c; and # rep 
is the number of bootstrap replicates (out of 500) placing the mixture between these two 
branches. All ranges shown are 95% bootstrap confidence intervals. See Figure [2]\ for an 
illustration of the parameters. 
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Table 2. Mixture parameters for other populations modeled as two-way admixtures. 



Admixed Pop 


Branch 1 + Branch2 


# rep 


a 


Branch 1 Loc 


Branch2Loc 


MixedDrift 


Daur 


Anc. N. Eurasian + Japanese 


350 


0.067-0.276 


0.008-0.126/0.195 


0.006-0.013/0.016 


0.006-0.015 




Surui + Japanese 


112 


0.021-0.058 


0.008-0.177/0.177 


0.005-0.010/0.015 


0.005-0.016 


Hezhen 


Anc. N. Eurasian + Japanese 


411 


0.068-0.273 


0.006-0.113/0.195 


0.006-0.013/0.016 


0.005-0.029 


Oroqen 


Anc. N. Eurasian + Japanese 


410 


0.093-0.333 


0.017-0.133/0.195 


0.005-0.013/0.015 


0.011-0.030 




Karitiana + Japanese 


53 


0.025-0.086 


0.014-0.136/0.136 


0.004-0.008/0.016 


0.008-0.026 


Yakut 


Anc. N. Eurasian + Japanese 


481 


0.494-0.769 


0.005-0.026/0.195 


0.012-0.016/0.016 


0.030-0.041 


Melanesian 


Dai + Papuan 


424 


0.160-0.260 


0.008-0.014/0.014 


0.165-0.201 / 0.247 


0.089-0.114 




Lahu + Papuan 


54 


0.155-0.255 


0.003-0.032/0.032 


0.167-0.208/0.249 


0.081-0.114 


Han 


Dai + Japanese 


440 


0.349-0.690 


0.004-0.014/0.014 


0.008-0.016/0.016 


0.002-0.006 



Mixture parameters from MixMapper for non-European populations fit as two-way 
admixtures. See Figure]^, and the caption of Table[T]for descriptions of the parameters. 
Branch 1 and Branch2 are the optimal split points for the mixing populations; branch choices 
are shown that that occur for at least 50 of 500 bootstrap replicates. 
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Table 3. Mixture parameters for populations modeled as three-way admixtures. 



Admixed Pop2 


Branch3 


# rep 


2 


Branch3Loc 


MixedDrifHA 


FinalDriftIB 


MixedDrift2 


Druze 


Mandenka 


330 


0.963-0.988 


0.000-0.009/0.009 


0.081-0.099 


0.022-0.030 


0.004-0.013 




Yoruba 


82 


0.965-0.991 


0.000-0.010/0.010 


0.080-0.099 


0.022-0.029 


0.005-0.013 




Anc. W. Eurasian 


79 


0.881-0.966 


0.041-0.158/0.232 


0.092-0.118 


0.000-0.024 


0.010-0.031 


Palestinian 


Anc. W. Eurasian 


294 


0.818-0.901 


0.031-0.104/0.231 


0.093-0.123 


0.000-0.021 


0.007-0.022 




Mandenka 


146 


0.909-0.937 


0.000-0.009 / 0.009 


0.083-0.097 


0.022-0.029 


0.001-0.007 




Yoruba 


53 


0.911-0.938 


0.000-0.010/0.010 


0.077-0.098 


0.021-0.029 


0.001-0.008 


Bedouin 


Anc. W. Eurasian 


271 


0.767-0.873 


0.019-0.086/0.231 


0.094-0.122 


0.000-0.022 


0.012-0.031 




Mandenka 


176 


0.856-0.923 


0.000-0.008/0.008 


0.080-0.099 


0.023-0.030 


0.006-0.018 


Mozabite 


Mandenka 


254 


0.686-0.775 


0.000-0.009/0.009 


0.088-0.109 


0.012-0.022 


0.017-0.032 




Anc. W. Eurasian 


142 


0.608-0.722 


0.002-0.026/0.232 


0.103-0.122 


0.000-0.011 


0.018-0.035 




Yoruba 


73 


0.669-0.767 


0.000-0.008/0.010 


0.086-0.108 


0.012-0.023 


0.017-0.031 


Hazara 


Anc. East Asian 


497 


0.364-0.471 


0.010-0.024/0.034 


0.080-0.115 


0.004-0.034 


0.004-0.013 


Uygur 


Anc. East Asian 


500 


0.318-0.438 


0.007-0.023 / 0.034 


0.088-0.123 


0.000-0.027 


0.000-0.009 



Mixture parameters from MixMapper for populations fit as three-way admixtures. In all cases 
one parent population splits from the (admixed) Sardinian branch and the other from Branch3. 
See Figure [2j3 and the caption of Table [T] for further descriptions of the parameters. Branch 
choices are shown that that occur for at least 50 of 500 bootstrap replicates. The "Anc. East 
Asian" branch is the common ancestral branch of the five East Asian populations in the 
unadmixed tree (Dai, Japanese, Lahu, Naxi, and Yi). 
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Figure SI. Comparison of allele frequency spectra within and outside gene regions. We 

divided the Panel 4 (San-ascertained) SNPs into three groups: those outside gene regions 
(101,944), those within gene regions but not in exons (58,1 10), and those within coding 
regions (3259). Allele frequency spectra restricted to each group are shown for the Yoruba 
population. Reduced heterozygosity within exon regions is evident, which suggests the action 
of purifying selection. (Inset) We observe the same effect in the genie, non-coding spectrum; it 
is less noticeable but can be seen at the edge of the spectrum. 
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Drift parameter 



Figure S2. TreeMix results on the HGDR Admixture graph for HGDP populations obtained 
with the TreeMix software, as reported in Pickrell and Pritchard (2012). Figure is reproduced 
from |Pickrell and Pritchard ( 2012| ) with permission of the authors and under the Creative 
Commons Attribution License. 
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Figure S3. Comparison of f 2 distances computed using original Illumina vs. 
San-ascertained SNPs. The heat map shows the log fold change in f 2 values obtained from 



the original HGDP data ( |Li et al.[|2008| ) versus the San- ascertained data ( |Patterson et al.[|20T2| ) 
used in this study. 
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Figure S4. Schematic of part of an admixture tree. Population C is derived from an 
admixture of populations A and B with proportion a coming from A. The /2 distances from 
C to the present-day populations A',B',X', Y' give four relations from which we are able to 
infer four parameters: the mixture fraction a, the locations of the split points A" and B" (i.e. r 
and s), and the combined drift a 2 a + (1 — a) 2 b + c. 
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Table SI. Mixture proportions for Sardinian and Basque from / 4 ratio estimation. 



lesi pop. 


Asian pop. American pop. 


a 


Sardinian 


Dai 


Karitiana 


OO O 1 c o 

do.o ± D.O 


Sardinian 


Dai 


Suruf 




Sardinian 


Lahu 


Karitiana 


23.1 ± 7.0 


Sardinian 


Lahu 


Suruf 


24.7 ± 7.6 


Basque 


Dai 


Karitiana 


22.8 ± 7.0 


Basque 


Dai 


Suruf 


24.0 ± 7.6 


Basque 


Lahu 


Karitiana 


23.1 ± 7.4 


Basque 


Lahu 


Suruf 


24.7 ± 8.0 



To validate the mixture proportions estimated by MixMapper for Sardinian and Basque, we 
applied / 4 ratio estimation. The fraction a of "ancient northern Eurasian" ancestry was 
estimated as a = / 4 (Papuan, Asian; Yoruba, European) / / 4 (Papuan, Asian; Yoruba, 
American), where the European population is Sardinian or Basque, Asian is Dai or Lahu, and 
American is Karitiana or Suruf. Standard errors are from 500 bootstrap replicates. Note that 
this calculation assumes the topology of the ancestral mixing populations as inferred by 
MixMapper (Figure H|A). 
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Text SI. /-statistics and population admixture. 



Here we include derivations of the allele frequency divergence equations solved by MixMap- 
per to determine the optimal placement of admixed populations. These results were first pre- 
sented in Reich et al. ( |2009 ) and |Patterson et aL ( 2012| ), and we reproduce them here for com- 
pleteness, with slightly different emphasis and notation. We also describe in the final paragraph 
(and in more detail in Methods) how the structure of the equations leads to a particular form of 
the system for a full admixture tree. 



Our basic quantity of interest is the /-statistic / 2 , as defined in Reich et al. (2009), which is 
the squared allele frequency difference between two populations at a biallelic SNR That is, at 
SNP locus i, we define 

f 2 (A,B) := (pa-Pb) 2 , 

where pa is the frequency of one allele in population A and ps is the frequency of the allele in 
population B. This is the same as Nei's minimum genetic distance Dab for the case of a biallelic 



locus (Nei 1987). As in 



Reich et al. 



(2009), we define the unbiased estimator fz(A, B), which 



is a function of finite population samples: 



TlA - 1 



n B -1 



where, for each of A and B, p is the the empirical allele frequency and n is the total number of 
sampled alleles. 

We can also think of B) itself as the outcome of a random process of genetic history. 
In this context, we define 

F*(A,B):=E(( Pa -Pb) 2 ), 

the expectation of (pa — Pb) 2 as a function of population parameters. So, for example, if B is 
descended from A via one generation of Wright-Fisher genetic drift in a population of size N, 
then Fi,(A,B)=p A (l- Pa) /2N. 
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While fl (A, B) is unbiased, its variance may be large, so in practice, we use the statistic 

f 2 (A,B):=-Y,f2(AB), 



m 

1=1 



S2 



i.e., the average of f 2 (A, B) over a set of m SNPs. As we discuss in more detail in Text 
F^A^B) is not the same for different loci, meaning f 2 (A,B) will depend on the choice of 
SNPs. However, we do know that f 2 (A, B) is an unbiased estimator of the true average f 2 (A : B) 
of f 2 (A, B) over the set of SNPs. 

The utility of the f 2 statistic is due largely to the relative ease of deriving equations for its 
expectation between populations on an admixture tree. The following derivations are borrowed 



from ( |Reich et al.[|2009| ). As above, let the frequency of a SNP i in population X be p X - Then, 



for example, 

E(fi(A,B)) = E({p A -p B f 



E((pa -Pp + Pp- Pb) 2 ) 

E((pa - ppf) + E((p P - p B ) 2 ) + 2E((p A - p P )(p P - p B )) 
E(P 2 (A,P)) + E(r 2 (B,P)), 



since the genetic drifts Pa — Pp and p P — p B are uncorrected and have expectation 0. We can 
decompose these terms further; if Q is a population along the branch between A and P, then: 

E(f 2 (A,P)) = E((p A -p P ) 2 ) 

= E((p A - p Q + p Q - pp) 2 ) 

= E((p A - p Q ) 2 ) + E((p Q - pp) 2 ) + 2E((p A - p Q )(p Q - p P )) 

= EU l 2 \A,Q)) + EUl{Q,P))- 
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Here, again, E{p A - p Q ) = E(p Q - p P ) = 0, but p A - Pq and p Q - p P are not independent; 
for example, if Pq — pp = —pp, i.e. p Q = 0, then necessarily Pa — Pq = 0. However, p A — p Q 
and Pq — pp are independent conditional on a single value of pq, meaning the conditional 
expectation of (p A — Pq)(pq — Pp) is 0. By the double expectation theorem, 

E((pa-Pq)(pq-Pp)) = E(E((p A -p Q )(p Q -pp)\p Q )) = E(E(0)) = 0. 

From E(f*(A, P)) = E(fi(A, Q)) + E(fi(Q, P)), we can take the average over a set of SNPs 
to yield, in the notation from above, 



F 2 (A,P) = F 2 {A,Q)+F 2 (Q,P). 

We have thus shown that f 2 distances are additive along an unadmixed-drift tree. This 
property is fundamental for our theoretical results and is also essential for finding admixtures, 
since, as we will see, additivity does not hold for admixed populations. 

Given a set of populations with allele frequencies at a set of SNPs, we can use the esti- 
mator f 2 to compute f 2 distances between each pair. These distances should be additive if the 
populations are related as a true tree. Thus, it is natural to build a phylogeny using neighbor- 



joining (Saitou and Nei 1987), yielding a fully parameterized tree with all branch lengths in- 
ferred. However, in practice, the tree will not exactly be additive, and we may wish to try fitting 
some population C as an admixture. To do so, we would have to specify six parameters (in the 



notation of Figure S4): the locations on the tree of A" and B"; the branch lengths f 2 (A", A), 
f 2 (B", B), and f 2 (C, C); and the mixture fraction. These are the variables r, s, a, b, c, and a. 

In order to fit C onto an unadmixed tree (that is, solve for the six mixture parameters), we 
use the equations for the expectations F 2 (P, C) of the f 2 distances between C and each other 



population P in the tree. Referring to Figure S4 with the point admixture model, the allele 
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frequency in C is pc = a Pa + (1 — «) Pb- So, for a single locus, using additivity, 



E(f 2 (A',C')) = E{{p A ,-p c ,f) 

= E((p A , - p A n + p A n - p C + PC - PC') 2 ) 

= E{{p A> - p A „f) + E{{p A „ -a p A -{I -a) p B f) + E((p c - p c ,) 2 ) 
= E(r 2 (A>,A")) + a 2 E(tt(A",A)) 

+ (1 - a) 2 E(r 2 (A",B)) + E(r 2 (C,C')). 

Averaging over SNPs, and replacing E(f 2 (A', C')) by the estimator /^(A', C), this becomes 

f 2 (A',C) = F 2 (A',X")-r + a 2 a 

+ (1 - a) 2 (r + F 2 (X", Y") + s + b) + c 
=^ f 2 (A',C) - F 2 (A',X") = (a 2 -2a)r + (1 -a) 2 s + a 2 a 

+ (1 - a) 2 b + c + (1 - a) 2 F 2 (X", Y"). 

The quantities F 2 (X", Y") and F 2 (A', X") are constants that can be read off of the neighbor- 
joining tree. Similarly, we have 

f 2 (B', C) - F 2 (B', Y") = a 2 r + (a 2 - l)s + a 2 a + (1 - a) 2 b + c + a 2 F 2 (X", Y"). 

For the outgroups X' and Y', we have 

f 2 (X',C) = a 2 (c + a + r + F 2 (X',X")) 

+ (1 - a) 2 (c + b + s + F 2 {X", Y") + F 2 (X', X")) 
+2a(l-a) (c + F 2 (X',X")) 
= a 2 r + (l-a) 2 s + a 2 a + (l-a) 2 b + c 
+ (l-a) 2 F 2 (X // ,Y // ) + F 2 (X',X") 
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and 



/ 2 (y', C') = a 2 r + (1 - afs + a 2 a + (1 - afb + c + a 2 F 2 (X", Y") + F 2 (y', F"). 

Assuming additivity within the neighbor-joining tree, any population descended from A" 
will give the same equation (the first type), as will any population descended from B" (the 
second type), and any outgroup (the third type, up to a constant and a coefficient of a). Thus, 
no matter how many populations there are in the unadmixed tree — and assuming there are at 
least two outgroups X' and Y' such that the points X" and Y" are distinct — the system of 
equations consisting of E(f 2 (P,C')) for all P will contain precisely enough information to 
solve for a, r, s, and the linear combination a 2 a + (1 — a) 2 b + c. We also note the useful fact 
that for a fixed value of a, the system is linear in the remaining variables. 
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Text S2. Heterozygosity and drift lengths. 

One disadvantage to building trees with f 2 statistics is that the values are not in easily 
interpretable units. For a single locus, the / 2 statistic measures the squared allele frequency 
change between two populations. However, in practice, one needs to compute an average / 2 
value over many loci. Since the amount of drift per generation is proportional to p(l — p), 
the expected frequency change in a given time interval will be different for loci with different 
initial frequencies. This means that the estimator f 2 depends on the distribution of frequencies 
of the SNPs used to calculate it. For example, within an / 2 -based phylogeny, the lengths of 
non-adjacent edges are not directly comparable. 

In order to make use of the properties of f 2 statistics for admixture tree building and still 
be able to present our final trees in more directly meaningful units, we will show now how f 2 
distances can be converted into absolute drift lengths. Again, we consider a biallelic, neutral 
SNP in two populations, with no further mutations, under a Wright-Fisher model of genetic 
drift. 

Suppose populations A and B are descended independently from a population P, and we 
have an allele with frequency p in P, p A = p + a in A, and p B = p + b in B. The (true) 
heterozygosities at this locus are hp = 2p(l — p), h\ = 2p A (l — Pa), and h B = 2p B (l — Pb)- 
As above, we write h A for the unbiased single-locus estimator 



{n A -l)p A {l-p A y 

h A for the multi-locus average of h A , and H A for the expectation of h A under the Wright- Fisher 
model (and similarly for B and P). 

Say A has experienced t A generations of drift with effective population size N A since the 
split from P, and B has experienced t B generations of drift with effective population size N B . 
Then it is well known that H\ = h P {l - D A ), where D A — 1 — (1 — l/{2N A )) tA , and 
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H B = h P {l- D B ). We also have 



H\ = E(2(p + a){l-p-a)) 

= E(hp - 2ap + 2a- 2ap - 2a 2 ) 



h P - 2E(a 2 ) 



h P -2F*(A,P), 



so 2F*{A,P) = h P D A . Likewise, 2F%(B,P) = h P D B and 2F$(A,B) = h P (D A + D B ) 
Finally, 

H A + H B + 2F*(A, B) = h P {l - D A ) + h P (l - D B ) + h P (D A + D B ) = 2h l P . 



This equation is essentially equivalent to one in Nei (1987]), although Nei interprets his version 



as a way to calculate the expected present-day heterozygosity rather than estimate the ancestral 
heterozygosity. To our knowledge, the equation has not been applied in the past for this second 
purpose. 

In terms of allele frequencies, the form of hp turns out to be very simple: 

hp = Pa+Pb - 2p A p B = p A (l - p B ) + p B (l - p A ), 

which is the probability that two alleles, one sampled from A and one from B, are different 
by state. We can see, therefore, that this probability remains constant in expectation after any 
amount of drift in A and B. This fact is easily proved directly: 

E{p A +p B - Zp A p B ) = 2p - 2p 2 = h P , 
where we use the independence of drift in A and B. 
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Let hp := (h A + h B + 2f2 (A, B))/2, and let h p denote the true average heterozygosity in P 
over an entire set of SNPs. Since h p is an unbiased estimator of (h l A + h B + 2f\(A, B))/2, its 
expectation under the Wright-Fisher model is h P . So, the average hp of hp over a set of SNPs 
is an unbiased (and potentially low- variance) estimator of hp. If we have already constructed a 
phylogenetic tree using pairwise f 2 statistics, we can use the inferred branch length f2{A', P) 
from a present-day population A to an ancestor P in order to estimate hp more directly as 
hp = h A + 2/ 2 (A P)- This allows us, for example, to estimate heterozygosities at intermediate 
points along branches or in the ancestors of present-day admixed populations. 

The statistic hp is interesting in its own right, as it gives an unbiased estimate of the het- 
erozygosity in the common ancestor of any pair of populations (for a certain subset of the 
genome). For our purposes, though, it is most useful because we can form the quotient 

, 2f 2 (A,P) 

d A := f , 

hp 

where the f 2 statistic is inferred from a tree. This statistic d A is not exactly unbiased, but by the 
law of large numbers, if we use many SNPs, its expectation is very nearly 

* ^ E(2f 2 (A,P)) h P D A 

E{d A ) ~ -7— = —r = D A , 

E{h P ) hp 

where we use the fact that D A is the same for all loci. Thus d is a simple, direct, nearly unbiased 
moment estimator for the drift length between a population and one of its ancestors. This allows 
us to convert branch lengths from f 2 distances into absolute drift lengths, one branch at a time, 
by inferring ancestral heterozygosities and then dividing. 

For a terminal admixed branch leading to a present-day population C with heterozygosity 
ha, we divide twice the inferred mixed drift c\ = a 2 a + (1 — a) 2 b + c (Figure |2| by the 
heterozygosity h* c , := he + 2c\. This is only an approximate conversion, since it utilizes a 
common value h* c , for what are really three disjoint branches, but the error should be very small 
with short drifts. 
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An alternative definition of (La would be 1 — fiA/hp, which also has expectation (roughly) 
Da- In most cases, we prefer to use the definition in the previous paragraph, which allows 
us to leverage the greater robustness of the f 2 statistics, especially when taken from a multi- 
population tree. 

We note that this estimate of drift lengths is similar in spirit to the widely-used statistic F S t- 
For example, under proper conditions, the expectation of F ST among populations that have 



diverged under unadmixed drift is also 1 — (1 — 1/ {2N e )) t (Nei, 1987 ). When Fst is calculated 
for two populations at a biallelic locus using the formula (Ho — ^s)/^d, where is the 
probability two alleles from different populations are different by state and n 5 is the (average) 



probability two alleles from the same population are different by state (as in Reich et al. (2009) 



or the measure G' ST in 



Nei 



(1987)), then this Fst is exactly half of our d. As a general rule, 



drift lengths d are approximately twice as large as values of F ST reported elsewhere. 
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